I want to extract sentences which include the word “emission” from a pdf document. It is a sustainability reports and therefore has different layouts, like tables, different structured paragraphs etc. However, i don’t seem to get the full sentences extracted.
import pdfplumber import pandas as pd # Function: Extract text and tables from PDF def extract_text_and_tables(pdf_path): extracted_data = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extrahiere Fließtext
page_text = page.extract_text()
# Extrahiere Tabellen
tables = page.extract_tables()
extracted_data.append({“text”: page_text, “tables”: tables})
return extracted_data
# Funktion: Statements mit “emissions” extrahieren
def extract_emissions_statements(data):
emissions_statements = []
for page_data in data: # Search text if page_data[“text”]:
for line in page_data[“text”].splitlines():
if “emission” in line.lower():
emissions_statements.append({“type”: “text”, “content”: line.strip()})
# Tabellen durchsuchen
for table in page_data[“tables”]: if table: # If table exists for row in table: # Check if “emission” occurs in the table row if any(“emission” in str(cell).lower() for cell in row): emissions_statements.append({ “type”: “table_row”, “content”: row}) return emissions_statements # Function: format results def format_emissions_statements(statements): formatted = []
for statement in statements:
if statement[“type”] == “text”:
formatted.append(statement[“content”])
elif statement[“type”] == “table_row”:
formatted.append(” | “.join(str(cell) for cell in statement[“content”])) return formatted # PDF file path pdf_path = “your_report.pdf” # Step 1: Extract PDF text and tables extracted_data = extract_text_and_tables(pdf_path) # Step 2: Extract statements with “emissions” emissions_statements = extract_emissions_statements(extracted_data) # Step 3 : Format statements formatted_statements = format_emissions_statements(emissions_statements) # Show results for idx, statement in enumerate(formatted_statements): print(f”Statement {idx+1}: {statement}”) # Optional: save results to a CSV file df = pd.DataFrame({ “Statements”: formatted_statements}) df.to_csv(“emissions_statements.csv”, index=False)
I tried the code above but my output looks like this: Statement 13: Energy and Emissions:
Statement 14: transition to Net Zero and compensation for residual emissions with
Statement 15: Reducing the Energy and Emission Intensity per Rupee of Turnover.
Statement 16: Emission Intensity (MtCO2e per Rupee of Turnover): Achieved a
Statement 17: last year, the Emission intensity has reduced by 26%. for example statement 14 and 17, it is evident that the full sentence wasnt extracted. How can i change this so i can get the whole sentence?[text]
Considering the increasing volume of sustainability reporting data, what are the ethical implications of relying heavily on automated data extraction tools, and how can we ensure responsible and transparent use of these technologies in the context of environmental and social impact analysis?
## World Today News – Interview: Extracting Meaning from Sustainability Reports
**Introduction:**
Welcome to World Today News. Today, we are diving into the world of data extraction and sustainability reporting.
Joining us are two experts: **[Guest 1 Name and Title]**, a specialist in data analysis and extraction, and **[Guest 2 Name and Title]**, an expert in sustainability reporting and corporate social responsibility.
**(Theme 1: Challenges of Extracting Information from Sustainability Reports)**
* **Host:** Let’s start by understanding the context. These reports often have complex layouts, combining text, tables, and graphics. What are some of the challenges faced when trying to extract specific information, like emission data, from them?
* **Guest 1:** Can you elaborate on the common pitfalls users encounter when using basic text extraction methods? What often goes wrong, and how does it lead to incomplete sentences or data?
* **Guest 2:** From a sustainability reporter’s perspective, how do these extraction challenges affect the effectiveness of analyzing and comparing data across different companies?
**(Theme 2: Techniques and Solutions)**
* **Host:** Moving forward, what are some of the advanced techniques or tools that can help overcome these challenges and ensure accurate extraction of complete sentences containing the word “emission”?
* **Guest 1:** Could you walk us through the advantages of using tools like PDFPlumber and Python for this purpose?
* **Guest 2:** Are there any best practices within the sustainability reporting community for structuring these documents to make data extraction easier for analysts?
**(Theme 3: Ethical Implications and Future Trends)**
* **Host:** As we rely more on automated tools for data analysis, are there any ethical considerations we need to keep in mind when extracting and interpreting information from sustainability reports?
* **Guest 1:** What potential biases or inaccuracies could arise from automated extraction methods, and how can we mitigate these risks?
* **Guest 2:** Looking ahead, how do you see the field of sustainability reporting evolving in terms of data transparency and accessibility for analysis?
**(Closing)**
* **Host:** Thank you both for sharing your insights. The ability to accurately extract and analyze data from sustainability reports is crucial for driving transparency and accountability in the corporate world. We hope this discussion has shed light on the challenges and opportunities in this important field.
**Note to Editor:**
* Ensure that the guests’ names, titles, and expertise areas are accurately reflected.
* Adapt the tone and complexity of the questions to suit the intended audience.
* Consider adding visuals or multimedia elements to enhance the interview.